Week 11

Linear regression (continued)

Goal. Predict tips based on total bills using a function: $$f(\text{total bill}) = a\cdot (\text{total bill}) + b$$

We want to find a function values of the parameters $a$ and $b$ such that the value

$$C(a, b) = \sum_i \left(y^{(i)} - f(x^{(i)})\right)^2$$

is as small as possible. Note that $x^{(i)}$ and $y^{(i)}$ are known (they are coming from the data), and the parameters $a, b$ are unknown. The function $C(a, b)$ is called the cost function.

From the course website:

Gradient descent

Gradient descent is a method of finding a minimum of a function $f$.

Algorithm:

  1. Choose some starting point $\bf x$ and a number $\alpha > 0 $ (the learning rate).
  2. Calculate the gradient $\nabla f(\bf x)$ of $f$ at the point $\bf x$. At the point $\bf x$ the function decreases fastest in the direction the vector $-\nabla f(\bf x)$.
  3. Set the new value of $\bf x$ to be ${\bf x} -\alpha\nabla f(\bf x)$.
  4. Repeat steps 1-3 some number of times.

From the course website:

Example 1. Gradient descent for $p(x, y) = ax^2 + by^2$.

Note. This is a good example to illustrate why before using gradient descent to compute regression of data we need to normalize data features. In the function above if values of the parameters a and b are close to each other, gradient descent converges quickly. Convergence is much slower though if one of these parameters is much larger than the other:

From the course website:

Himmelblau function:

Rosenbrock function:

Linear regression with sklearn

Linear regression with total_bill and size

Upshot: Adding size data does not seem to improve meaningfully the fit of the model.

Fit accuracy and $R^2$

Linear regression with total_bill and day:

Create dummy variables for the day column:

Upshot: The day data does not improve the model.

Regression with k-NN

String operations

Regular expressions

Raw strings

From the course website:

Alternative:

Grouping patterns:

Character classes:

Note matches of newline characters:

Repetitions:

Greedy vs non-greedy matching:

Flags:

Anchors:

Example. Match dollar amounts:

Example. Match phone numbers:

Example. Match emails:

Example. Match dates:

re.findall()

Match groups:

Non-capturing match groups:

re.sub()

Regex exercises